IMMUNOLOGY 



A guide to bioinformatics for immunologists 

Fiona J. Whelan\ Nicholas V. L. Yap^, Micliael G. Surette \ G. Brian Golding^ and Dawn M. E. Bowdisti^* 

' Department of Biochemistry and Biomedical Sciences, McMaster University, IHamilton, ON, Canada 

Department of Biology, McMaster University, Hamilton, ON, Canada 
^ Department of Pathology and Molecular Medicine, McMaster University, Hamilton, ON, Canada 



Edited by: 

Fabrizio Mattel, Istituto Superiore di 
Sanita, Italy 

Reviewed by: 

Geanncarlo Lugo-Villarino, Centre 
National de la Recherche Scientifique, 
France 

Can Peng, Tongji University, China 

'Correspondence: 

Dawn M. E. Bowdish, Department of 
Pathology and Molecular Medicine, 
McMaster Immunology Research 
Centre, M. G. DeGroote Institute for 
Infectious Disease Research, 
McMaster University 1280 Main 
Street West Hamilton, ON L8S 4K1, 
Canada 

e-mail: bowdish&mcmasterca 



Bioinformatics includes a suite of methods, which are cheap, approachable, and many 
of which are easily accessible without any sort of specialized bioinformatic training. Yet, 
despite this, bioinformatic tools are under-utilized by immunologists. Herein, we review 
a representative set of publicly available, easy-to-use bioinformatic tools using our own 
research on an under-annotated human gene, SCARA3, as an example. SCARA3 shares 
an evolutionary relationship with the class A scavenger receptors, but preliminary research 
showed that it was divergent enough that its function remained unclear. In our quest for 
more information about this gene - did it share gene sequence similarities to other scav- 
enger receptors? Did it contain conserved protein domains? Where was it expressed in 
the human body? - we discovered the power and informative potential of publicly available 
bioinformatic tools designed for the novice in mind, which allowed us to hypothesize on 
the regulation, structure, and function of this protein. We argue that these tools are largely 
applicable to many facets of immunology research. 

Keywords: bioinformatics, immunologY, sequence alignments, single-nucleotide polymorphiisms, transcriptional 
profiling, scavenger receptor 



INTRODUCTION 

Although pubhc perception indicates that bioinformatics is a rela- 
tively new discipline borne out of the "omics" age, bioinformatics 
is more than just "data crunching" and, in some form, has been 
around longer than our understanding of how DNA translates 
into protein. The term "bioinformatics" was coined in 1970 by 
Hogeweg and Hesper to mean "the study of informatic processes 
in biotic systems" (1). In this sense, the interdisciplinary approach 
characteristic of bioinformatics combination of information sci- 
ence, mathematics, and biology is not a new venture. Even before 
the term was ever used, Erwin Schrodinger, recognizable for his 
thought experiments and developments in quantum mechanics 
(2), gave a series of lectures in war-time Ireland entitled What 
is Life? (3), encouraging many classically trained physicists and 
chemists, including Francis Crick and Rosalind Franklin, to turn 
their interests toward biology. These new recruits became some of 
the first interdisciplinary scientists. Since then, it has been used 
for a broad range of applications, including the Human Genome 
Project (4), the discovery of new drugs (3), and further elucidation 
of Darwin's Tree of Life (6). 

lust as bioinformatics can be applied to the study of human 
genetics and evolution, it can also be used to inform immunology 
research. This combination of immunology and computational 
biology is sometimes referred to as "immunomics" or "computa- 
tional immunology." Bioinformatic techniques have been used to 
model how major histocompatibility complex (MHC) heterozy- 
gosity affects one's interaction with bacteria (7) and the influenza 
virus (8), how host stress affects the pathogenicity of Pseudomonas 
aeruginosa in the human gut (9), and why the frequency of 
staphylococcal-induced toxic stress response is low even though 
infections by these bacteria are high (10). While some of these 



investigations require a user to have extensive knowledge of com- 
putational science, increasingly, bioinformatic tools are equipped 
with intuitive graphical user interfaces and so are more acces- 
sible to those without such a background. Many powerful and 
informative results can be generated with an Internet connection 
and a DNA sequence of interest. The plethora of publicly avail- 
able, easy-to-use bioinformatic tools that investigate nucleotide or 
protein sequences, can provide information about potential post- 
translational modifications, predict protein structure and gene 
expression, and document genetic variation within a population, 
species, or kingdom. Within minutes, information can be gener- 
ated to guide in vitro experiments, which can save the typical bench 
scientist both time and resources. 

This review uses recent examples of our own quest to seek out 
information on a potential member of the class A scavenger recep- 
tor family, SCARA3, via publicly available bioinformatic tools. 
The scavenger receptors are a family of proteins required for host 
defense and phagocytosis of senescent cells and modified proteins 
(11). Although SCARA3 is a member of this family, there is very 
little information on its structure or function. Through an exam- 
ple of our bioinformatic analyses of the SCARA3 gene, this review 
aims to explain how approachable and accessible bioinformatic 
tools can be used to obtain sequence and structural information, 
gene expression patterns, genetic variation across human popula- 
tions and, most importantly, to generate informed hypotheses that 
can be tested bench-side. 

SEQUENCE ANALYSIS 

ACQUIRING A FASTA SEQUENCE FROM A PUBLIC ONLINE DATABASE 

The FASTA file format was originally described by William R. Pear- 
son as part of his 1990 bioinformatic software package of the same 



www.frontiersln.org 



December 2013 | Volume 4 | Article 416 | 1 



Whelan et al. 



Guide to bioinformatics for immunologists 



name (12). Since this time, it has become the de facto file format for 
most, if not all, bioinformatic sequence analyses. Simply put, this 
format is a description of a sequence preceded by a greater-than 
(">") symbol, followed by the sequence in the standard lUPAC 
nucleotide or protein code. 

An accurately annotated and appropriately formatted sequence 
of the gene(s) of interest is a prerequisite of many bioinfor- 
matic techniques. Since 2007, the National Center for Biotech- 
nology Information (NCBI) has made the nucleotide sequences 
of more than 260,000 organisms accessible through its publicly 
available database, GenBank (13). GenBank's global coverage of 
sequence data is ensured by daily exchanges of information with 
the European Molecular Biology Laboratory's (EMBL) Nucleotide 
Sequence Database, and the DNA Data Bank of Japan (DDBJ) 
(13). The information stored in GenBank is made accessible 
through Entrez, NCBIs comprehensive search engine (13). Users 
of Entrez have the option of searching within specific databases, 
such as nucleotide and protein sequences. Expressed Sequence Tags 
(ESTs), and macromolecular structures (14). 

One such database is Entrez Gene, which provides gene- 
centered information (15). Entrez Gene includes only those 
gene records corresponding to genomes which have been fully 
sequenced or to genes that have active research groups associated 



with them (15); searches of this or other curated databases avoid 
poor search results. Additionally, because some annotations in 
complete genomes are quite suspect, the use of Entrez Gene 
prevents the use of inappropriately annotated or low quality 
sequences. Searching this database provides useful information 
such as the ''Genomic regions, transcripts, and products'' section, 
which is helpful in visualizing the exonic structure and chro- 
mosomal orientation of a gene. The "Bibliography'' section sum- 
marizes peer-reviewed articles in which the gene is at the fore- 
front. Additionally, a multiple sequence alignment of the gene of 
interest to known homologs can be generated by choosing the 
Homology'' section under ''General gene information"; this may 
be of interest to those conducting cross-species or evolutionary 
studies. 

When gathering sequence data, the user should refer to the 
section entitled ''NCBI Reference Sequences (RefSeq)" (Figure 1). 
Using RefSeqs is important because these sequences meet a strin- 
gent standard set by NCBI, including the assurance that supporting 
evidence for the gene is available (16). Here, at least one set of 
mRNA and protein sequences will be displayed; isoforms of a given 
protein are displayed with multiple entries. 

Although we have chosen to use the NCBI's Entrez platform 
in this example it should be noted that there are other equally 



NCBI Reference Sequences (RefSeq) 
B RefSeos maintained independently of Annotated Genomes 



scavenger receptor class A member 3 isoform 1 [Homo sapiens] 



These reference sequences exist independently of genome builds. Explain 
mRNA and Protein(s) 

NM Q16240.2 ^ NP 057324.2 j^v en qer receptor class A member 3 isoform 1 
See proteins rdentican^P 057324.2 
Status: REVIEWED 

Description Transcript VarianL This variant (i), also known as CSRi. encodes the longer isoform 
(1). 

Source sequencels) ABQQ7829 ^^^m 
Consensus CDS CCD534871.1 
UnlProtKB/Swtes-Prot Q6AZY7 



Related ENSPOQ000301904 . OTTHUMP00000225307 , ENST00000301904 . 
OTTHUMT0000037625B 

MM 182826.1 NP 878185.1 scavenger receptor class A member 3 isoform 2 
See proteins identical to NP 8781B5.1 
Status: REVIEWED 

Description Transcript VarianL This variant (2). also known as CSR2, differs in the 3" end-region. 

which includes a part of the coding region, as compared to variant 1 . The resulting 
isoform (2) has a distinct and shorter C-terminus, as compared to isoform 1. 
Source sequer>ce(s} AB007e30 
Consensus CDS CCDS34B70.1 
UnlProtKB/Swiss-Prot Q6AZY7 



jeference Sequence; NP_057324.2 
[ ^FASTAJGraohlCS 



Go to: 1^1 

LOCUS 

DEFISITION 

ACCESSION 

VERSION 

DBSOURCE 

KEYWORDS 

SOURCE 

ORGRNISK 



PHI 17-APR-2013 
sapiens 1 . 



JOURNAL 
PU3MED 
REMARK 

REFERENCE 
AUTHORS 
TITLE 

JOURNAL 
PU3HED 
REMARK 



NP_057324 606 aa linear 

scavenger receptor class A member 3 isoform 1 [Hoi 
NP_057324 

NP_057324.2 01:33598924 
REFSEQ: accession NM Q1624Q.2 

Homo sapiens (hiiman) 
Homo sapjLens 

Eukaryota; Kotazoa; chordata; craniate; vertebrata; Euteleostoni; 
Maitonalia; Eutheria; Euarchontogiires; Primates; Haplorrhini; 
Catarrhi:ii ; Hominidao ; ilomo, 

1 (residues 1 to 606) 

Zheng, Z .L. , Tan,L.Z . , Yu, Y. P, , Michalopoulos,G . and Luo, J.H . 

interaction of csRl with xiap reverses inhibition of caspascs and 

accelerates cell death 

Am. J. Pathol. 181 (2), 463-471 (2012) 

226B3311 

GcneRIF; The binding o£ CSRI with XIAP enhanced caspase-S and 
ca5pa5c-3 protease activities. 

2 (residues 1 to 606) 

3oc>:, A. J. , NyiTiocn, D.A. , 3renne,K. , Kacm, J. and Davidson, 3. 
3CARA3 mRNA is ovoroxprossed in ovarian carctnrana compared with 
breast carcinoma cffusio.is 
Hum. Pathol. 43 (5), 669-674 
21855113 

GeneRIF: The consistently higli SCARA3 levels ; 



'4 K012) 
ligll SCARA. 

t 



class A member 3 isoform 1 [Homo 



scavenger receptor class A member 3 isoform 1 [Homo sapiens] 

NCBI Reference Sequence: NP 057324.2 
G^'nPflnl Granhirfi 

>gil33598924|re£|NP_057324.z| scavenger n 
sapiens] 

KKVRSAGGDGDALCVTEEDLAGDDEDHPTFPCTOKGRFGPRCSBCOKNLSLHTSVHILYLriALLLVAVA 
VlASLVFBIWDSI^EDISLTQSiyDKia,VIJ«}ICNL0GLOPKALmCSFClEAGOLGPEIRia/)EELEGIQ 
KLLLAOEVOLDOTLOAOEVLSrTSRQrSQEKGSCSFSIfiOVNOSLGLFIAOVRGKOATTACLDLSLKDi-T 
OECYDVKAAVUOIN'FTVGOTSEKIHGtQRKTDEETLTLOKIVTDWONYTRLFSGLRTrsrKTGEAVXNIQ 
ATLGRSSQRrSONSEgKHDLVLOVKGLQLOLDNISSFLDDHEENMIDLOYHrHYAOSRTVERFESLEGRM 
ASHElEIGTIFTNlSATDSHVHSMLKYLDDVBLSCrLGFflTilAEELYYLNKSVSlHLGTTDLLRERFSIi 
SARLDLNVRNLSKIVEEKKAVDrOHGEILRNVTILRCAPGPPGPaGFKGDy.GVKGPVGGRGFKGDPGSw: 
PLGPOGFQGOPGEAGPVGERGPVGPRGFPGLKGSKGSFGrGGPRGQPGPKGDIGPPGPEGPFGSPGPSGP 
OCKPCIACKTCSPGORGAKCPKCEPCIOCPPGLPCPPCPPGSOSFY 



FIGURE 1 I Retrieval of nucleic acid and protein FASTA formatted 
sequences from an Entrez Gene search. Upon searching for and selecting 
the Homo sapiens SCARA3 gene, a variety of information can be retrieved 
including identifiers for the EnsembI, Mendelian Inheritance of Man (IVIIM), 
and Human Protein Reference Database, in addition to information about 
the genomic context of the gene. From the "NCBI Reference Sequences 
(RefSeq)" section, the most up-to-date and thoroughly curated FASTA 
formatted sequences may be obtained. Sequences with Accession 



Identifiers beginning with NM or XM are mRNA and NP or XP are protein. 
Multiple RefSeq entries may be present in the case of gene isoforms. 
Selecting the NP_057324.2 Accession Identifier, information concerning the 
SCARA3 isoform 1 , protein is displayed, including links to publications 
involving this protein. By selecting "FASTA" at the top of the page, the 
FASTA formatted sequence is provided, which includes the reference 
number, species, and name. This sequence is suitable for input into most 
online bioinformatic tools. 



Frontiers in Immunology | Molecular hnate Immunity 



December 2013 [Volume 4 | Article 416 | 2 



Whelan et al. 



Guide to bioinformatics for immunologists 



Table 1 | Public databases containing DNA, mRNA and protein sequences. 



A trf\ n\i m 
ntfi u 1 ly 1 1 1 


Name 


nuaiCii uy 


URL 






GenBank 


GenBank 


National Center for 


http://www.ncbi. 


An annotated collection of all publicly available DNA sequences 


Benson 






Biotechnology 


nlm.nih.gov/ 


(EST gene and transcript sequences and unannotated single 


etal. (13) 






Information 


genbank/ 


read sequences from genome sequencing projects) 




EMBL- 


EMBL Nucleotide 


European Molecular 


http://www.ebi. 


A collection of DNA and RNA sequences submitted by 


Kulikova 


BANK 


Sequence 


Biology Laboratory 


ac.uk/embl/ 


researchers, genome sequencing projects, and patent 


(56) 




Database 


(EMBL) 




applications. In addition to querying individual genes, whole 












genomes may be browsed 




DDBJ 


DNA Data Bank 


DNA Data Bank of 


http://www.ddbj. 


A collection of nucleotide sequences where sequences of 


Miyazaki 




of Japan 


Japan 


nig.ac.jp/ 


recently sequenced genomes are particularly well represented 


(57) 


ucsc 


UCSC Genome 


Genome 


http://genome. 


Contains reference sequences and working draft assemblies 


Kent et al. 




Bioinformatics 


Bioinformatics Group 


ucsc.edu/ 


for a large collection of genomes. Source of sequences for 


(58) 




site 


at the University of 




genomes that have not been comprehensively sequenced and 








California Santa Cruz 




annotated (e.g., Neadertal) 





appropriate databases available. Although it is beyond the scope of 
this review to describe them in detail, Table 1 provides an overview. 

PREDICTING POST-TRANSLATIONAL MODIFICATIONS 

Post-translational modifications of a protein can include phos- 
phorylation, glycosylation, ubiquitination, methylation, and lipi- 
dation amongst many others. Post-translational modification may 
change the function, cellular localization, or abundance of a pro- 
tein. Just as understanding protein domains and genomic context 
can inform the function of a protein, understanding how a pro- 
tein is post-translationally modified may provide important clues 
regarding function. For example, signal transduction mediated by 
the immunoreceptor tyrosine-based activation motif (ITAM) of 
the T-ceU receptor, requires the dual phosphorylation of two of its 
tyrosine residues [reviewed in Ref (17)]. Predictions as to which 
of the many possible post-translational modifications are statis- 
tically likely in a given protein may explain cellular localization 
patterns, regulation of protein abundance, and indicate whether 
the protein contains specific signaling properties. 

As an example, previous research has demonstrated that the 
prototypical member of the class A scavenger receptors, SRAI, 
has a serine in the cytoplasmic domain of this protein, which, 
when phosphorylated, is essential for its phagocytic function (18, 
19). However, it is not known whether the other members of the 
class A scavenger receptor family, such as SCARA3, contain similar 
sites of post-translational modifications. Knowledge of such sites 
would suggest that SCARA3, like SRAI, is also a phagocytic recep- 
tor whose signaling pathways are conserved within this receptor 
family. The SCARA3 FASTA formatted protein sequence obtained 
from NCBI was analyzed using the NetPhos 2.0 Server (Figure 2). 
This tool was built on the knowledge that the 7- to 12-amino 
acids neighboring a phosphorylated residue tend to have a speci- 
fied composition in order to be recognized by specific kinases and 
phosphatases (^"). Using this information, NetPhos predicts sites 
of phosphorylation in a protein sequence. In the case of SCARA3, 
multiple sites were identified over the threshold probability value 
defined by the software to be serine (S)-, threonine (T)-, or tyrosine 



(Y) -phosphorylated (Figure 2), indicating that even though these 
residues differ from those identified in SRAI, SCARA3 may possess 
similar functionality. 

In addition to NetPhos, there are many post-translational mod- 
ification prediction tools publically available which require the sole 
input of a protein sequence. A representative collection of these 
tools is summarized in Table 2. 

IDENTIFYING CONSERVED MOTIFS 

Some regions of a gene are more susceptible to the accumulation 
of mutational change over evolutionary time than others and pro- 
tection from change is largely due to the biological importance of 
such a region (21). Highly conserved regions have generally been 
demonstrated to encode for areas essential for a protein's expres- 
sion or function where even slight changes would threaten the 
organism's survival. In contrast, in other areas of a protein, neu- 
tral mutations that do not affect protein function may accumulate 
over time (2 1). By examining areas of conservation in a protein of 
interest across its orthologs (i.e., genes separated by a speciation 
event; the same gene in different species) and paralogs (i.e., genes 
separated by a gene duplication event; similar genes in the same 
species) one can predict regions that are important for expression 
or function (22). 

This is accomplished by performing sequence alignments. An 
alignment of sequences simply put, is the addition of gaps (repre- 
sented as "-"s) at variable positions in a set of input sequences in 
order to maximize the number of similar residues per column in 
the alignment (22). These alignments come in a variety of forms: 
first, they can either be "pairwise" involve only two sequences, or 
"multiple" involve more than two sequences. Second, they can be 
"global" which means the full length of all sequences are aligned, or 
"local," indicating that the best alignment is displayed, even if that 
means only aligning a portion of the inputted sequences to each 
other (2.i). The use of pairwise versus multiple sequence align- 
ments depends on how many closely related proteins the user has 
at their disposal; the more sequences, if they are closely related, 
will better inform the alignment. However, the choice of local 



www.frontiersin.org 



December 2013 | Volume 4 | Article 416 | 3 



Whelan et al. 



Guide to bioinformatics for immunologists 



CENTERFO 
RBIOLOGI 
CALSEQU 
ENCEANA 
LYSIS CBS 



OTHER 

BIOINFORMATICS 
LINKS 



CBS » CBS Prediction Servers » NetPlw 

NetPhos 2.0 Server 

The NelPhos 2.0 server produces neural network predictions for serine, threonine and tyrosine phosphorylation sites in eukaryotic 
proteins. 

Kinase specific phosphorylation predictions are available at: http;//www.cbs.dtu.dl</servicesyNe1PhosKy 



SUBMISSION 

Paste a single sequence or several sequences in FASTA format into the field below: 
>gii33598924lref|NP_057324.2l scavenger receptor class A member 3 isoform 1 (Homo 
sapiens] 

MKVRSACCDCDALCVTECDLACDDEDMPTrPCTQKCRPCPRCSRCQKNLSLHTSVRILYLrLALLLVAVA 

VLASLVrRKVDSLSEDISLTQSIYDKKLVLMQKMQCLDPKALNNCSrCHIlACQLCPEIRKLQtllLEClQ 

KLLLAQtVQLDQTLQAQEVLSTTSRQISQCMCSCSrSIHQVNQSLCLrLAQVRCWQATTACLDLSLKDLT 

QECYDVKAAVtlQINrTVCQTSnVIHCIQRKTDEin"LTLQKIVTDWQNYTRLrSCLRTTSTKTCCAVKNIQ 

ATLCASSQRISQNSESMHDLVLQVMCLQLQLDNISSrLDDHEENMHDLQYHTHYAQ^RTVERrESLECRM 



ri 



Submit a file in FASTA format directly from your local disk: 
(choose File) No file chosen 

Predict on: ^tyrosine (^serine [^threonine | 




0 Generate graphics ( Submit ) (clear fields) 
Restrictions: 

At most 50 sequences and 200.000 amino acids per submission: each sequence 



Confidentiality: 

Jjg fiequ^cesarej< ept i^ i^ential^r^ wni^^^^gi^ii ll^f^^^^^'^^^^^ ^ ^ , 



tban 4.000 amino acids. 



Serine predictions 



Threonine predictions 



Tyrosine predictions 



Name 


Pos 


Context 

V 


Score 


Pred 


Name 


Pos 


Context 

V 


Score 


Pred Name 


Pos 


Context 

V 


Score 


Pred 


gi_33598924 


5 


MKVRSAGCD 


0 


200 




gi_3359e924 


16 


ALCVTEEDL 


0 


041 


gi_ 


33598924 


59 


VHILYLFLA 


0. 


032 




gi 3359B924 


43 


CPRCSRCQK 


0 


035 




gi 33598924 


29 


EDMPTFPCT 


0 


026 


gi. 


33598924 


94 


TQSIYDKKL 


0. 


206 




gl_33598924 


50 


QKNLSLHTS 


0 


211 




gi_33598924 


33 


TFPCTQKGR 


0 


459 


gi_ 


33598924 


214 


TQECYDVKA 


0. 


507 


• y. 


gi 33598924 


54 


SLHTSVRIL 


0 


048 




gi 33598924 


53 


LSLUTSVRI 


0 


017 


gi. 


33598924 


258 


DWQNYTRLF 


0. 


097 




gi_33598924 


74 


AVLASLVFR 


0 


006 




gi_3359e924 


90 


DISLTQSIY 


0 


039 


gi_ 


33598924 


330 


HDLQYHTHY 


0. 


542 


•Y« 


gi 33598924 


82 


RKVDSLSED 


0 


991 


*S* 


gi 33598924 


153 


QLDQTLQAQ 


0 


016 


gi_ 


33598924 


334 


YHTHYAQNR 


0. 


830 


*Y* 


gi_33598924 


84 


VDSLSEDIS 


0 


597 


•s* 


gi_33598924 


162 


EVLSTTSRQ 


0 


033 


gi_ 


33598924 


377 


SMLKYLDDV 


0. 


865 


*Y* 


gi 33598924 


88 


SEDISLTQS 


0 


196 




gi 33598924 


163 


VLSTTSRQI 


0 


450 


gi_ 


33598924 


397 


AEELYYLNK 


0. 


899 


•Y* 


gi_3359e924 


92 


SLTOSIYDK 


0 


987 


*S* 


gi 33598924 


198 


CWQATTAGL 


0 


197 


gi_ 


33598924 


398 


EELYYLNKS 


0. 


965 


*Y* 


gi_33598924 


117 


LNNCSFCHE 


r\ 


004 




gi_33598924 


199 


WOATTAGLD 


0 


274 


gi. 


33598924 


606 


SQSFV 


0. 


239 





NetPhos 2.0: predicted phosphorylation sites in gi 33598924 
I I I I I 



Ser i ne 
Threon i ne 
Tyros i ne 




j.llll I 

Sequence position 



FIGURE 2 I Prediction of post-translational modifications in SCARA3.The 

FASTA formatted sequence of SCARA3 from Homo sapiens was entered into 
the NetPhos 2.0 Server to predict serine (S), threonine (T), and tyrosine (Y) 
residues that may be phosphorylated. Each instance of these residues and 
surrounding sequences are displayed under the "Context" column. Scores 



above 0.5 are considered to be significant and those residues are highlighted 
in the "Pred" column with asterisks. The Server also displays the output 
graphically, including a horizontal line to indicate the 0.5 score threshold. 
Multiple residues in SCARA3 reach this threshold of significance, and may 
guide further in vitro analysis of this protein. 



Frontiers in Immunology | Molecular Innate Immunity 



December 2013 [Volume 4 | Article 416 | 4 



Whelan et al. 



Guide to bioinformatics for immunologists 



Table 2 | A representative collection of bioinformatic tools for post-translational modification (PTM) prediction. 



Name 


Hosted by 


PTM predicted 


URL/Reference 


IMclL/OlyL I.U ofcJIVtJI 


l^cMLci lUi DIUIUyiL.dl ocLjUclUjC Mlldlyblb 


1 1 icii II lubyid Lioi 1 biLcb III n Id 1 1 II I Id iicii 1 


1 ILL[J.//yt;l lUl I Ic.L-Ub.ULU.UIv/bcl VIL-cb/ 




(CBS) 


proteins 


NetCGIyc/; Julenius (59) 


M^/IT 

IMIVI 1 


The Research Institute of Molecular 


The MYR predictor for prediction of 


http!//mendel.imp.univie.ac.at/ 




Pathology (IMP) Bioinformatics Group 


N-terminal N-myristoylation of proteins 


my ristate/SU PLpredictor.htm 


ricro. ricliyidLIUll 


IMc ncbcdiCM lilbLILULc Ul IVIUIctjUldi 


rIcUILLb VVIIcLllcl d [JiULclll lb piciiyidLcU 


h+lr^ ■ // no o n i no jo ar^ c3l"/ProPQ/' 
1 ILL[J.// 1 1 IcIIUcl.ll iip.dO.dL/ rl cr O/ , 


Prediction Suite 


Pathology (IMP) Bioinformatics Group 




Maurer-Stroh and Eisenhaber (60) 




(""on+or fnr Rinlnnir^al Qoni loni^o AnaK/ciQ 
■wC! 1 LCI lUI U HJlvJy iLrd 1 Ot;L|UCI IL.C rAI Idiycjio 


ProH if^t mn c nf nhncnhnrv/latinn citoc nn 
ricuiL/Liuiio \j\ [jii>Jo[jiiuiyidLHJii oilco uii 


httn ' //rtonronoo r'Kc Hti i rlli'/cor\/ir'Oc/ 

1 ILL|J.//ycl lUI 1 ItJ.L-Ucj.Ll LU .U Ix/ OCI V IL.CCJ/ 




(CBS) 


serine, threonine, and tyrosine residues 


NetPhos/; Blom et al. (20) 


1 1 ifc; oU II 11 Id LUI 


CXrMvjy D lUI M 1 Ul 1 1 Id Hub nfcJbUUHjfc; r^ui Ldl 


r^l cUIL LIUI 1 Ul LyiUblllc bUlldLIUII bILfcJb 


K+1"»o'//va/q[o ovjo3c\/ /o m / ci i Tin^1"jor/" 
1 1 LL[J .// VvcU.cApdby. Ul y/ bU 1 1 1 1 Id lUI/ , 








Monigatti et al. (61) 


SUMOplot Analysis 


Abgent 


Predict the probability of sumoylation 


http://www.abgent.com/tools/ 


tool 




sites within a protein sequence 




ProP 1.0 Server 


Cernter for Biological Sequence Analysis 


Predicts arginine and lysine propeptide 


http://genome.cbs.dtu.dk/services/ 




(CBS) 


cleavage sites 


ProP/; Duckert etal. (62) 


UBPred 


Indiana University, Columbia University, 


Predicts protein ubiquitination sites 


http://www.ubpred.org/; Radivojac 




University of California, San Diego, CA, USA 




etal. (63) 



There are many publically available PTM prediction tools that require only the Input of a protein sequence. This table outlines a representative subset that are available 
as online tools. 



versus global alignments is not as straightforward. The results of 
local alignments are often more meaningful because the method 
emphasizes regions of high similarity between sequences (23). 
These types of alignments are quite informative when compar- 
ing divergent protein sequences that are hypothesized to share a 
specific protein domain. However, often a researcher is interested 
in comparing full-length sequences of high similarity to each other, 
in which case a global alignment must be employed. 

In our case, we were interested in the similarities of SCARA3 to 
the other members of the class A scavenger receptors (its paralogs) 
that, to date, have been better characterized in terms of biological 
function and expression. Any similarities between specific regions 
of SCARA3 and these well- characterized cousins would allow us 
to hypothesize that these regions perform similar functions in 
both proteins. As such, we computed a global alignment of the 
human SCARA3 protein with the other four members of this pro- 
tein family (Figure 3). A global sequence alignment is used in this 
case because previous research has suggested that these proteins 
have evolved in parallel for many millions of years, resulting in 
some similar biological functions, suggesting that they share areas 
of similarity across the full lengths of these proteins (11, 24). 

European Molecular Biology Laboratory's European Bioinfor- 
matics Institute (EBI) has a set of tools available for both pairwise^ 
and multiple sequence alignments^. In the example in Figure 3, 
we perform a global multiple sequence alignment of the class A 
scavenger receptor protein sequences from Homo sapiens using 
the ClustalW2 tool (Figure 3A). ClustalW2 was chosen because it 



^ http://www.ebi.ac.uk/Tools/psa 
"http://www.ebi.ac.uk/Tools/msa 



is suitable for "medium-length" alignments, which is perfect for 
analysis of the scavenger receptors, which are approximately 500 
base pairs in length. Additionally, ClustalW2 produces a color- 
ful output, which makes it easy to visualize conserved residues 
and patterns of charge or residue repeats by visual inspection. 
A portion of the results of this alignment can be visualized in 
Figure 3B. Notably, this alignment identified an area of conser- 
vation at the C-terminal region of the collagenous domain across 
all five members of the class A scavenger receptors (Figure 3C). 
This area, consisting of predominantly charged amino acids, has 
been previously implicated in ligand binding in SRAI (25). Con- 
sequently we might predict that this region is a ligand-binding 
site not only in SRAI, but also in the other four members of this 
protein family. 

Another approach to the identification of conserved motifs, 
especially useful when no known homologs exist, are special- 
ized tools that examine an input sequence for known domains. 
An example of such a tool is NCBI's Conserved Domain Search 
(CD-search) which compares a user-provided sequence against an 
NCBI-curated database of known domains (26). These tools do 
not find the intricacies of sequence alignments but can, however, 
be very informative. 

STRUCTURAL ANALYSIS 

ACQUIRING PUBLICALLY AVAILABLE MACROMOLECULAR STRUCTURES 

Of course, while clues to a protein's function can be hidden within 
its sequence, at the end of the day, it's the protein's structure 
that dictates its function. Because of the ease of DNA and pro- 
tein sequencing given today's technologies, there is more sequence 
data available compared to structural evidence; however, databases 



www.frontiersin.org 



December 2013 | Volume 4 | Article 416 | 5 



Services Research Training Industry About u 



Guide to bioinformatics for immunologists 



Input form : Web services 



Help & Documentation 



< Share I » Ffeedback 



Tools > Multiple Sequence Alignment > CIU5talW2 

Multiple Sequence Alignment 

ClustalW2 is a general purpose multiple sequence alignment program for ONA or proteins. 
Note: clustatwz is no ionger being maintained. Please consider using t)ie new version Instead: Clustal Omega 
STEP 1 -Enter your input sequences 



Enter or paste a set of[ Protein ; J sequences In any supported fomat: 

>gil203575i2lrerNP 613729 II macroptiage scavenger recepior types land tl isotormtype 1 [Homo sapiens] 

MEQWDHFHNQOEDTDSCSESVKFDARSMTALLPPNPKNSPSLQEKLKSFKAALIALYLLVFAVLIPLIGI 

VAAQLLKWETKNCSVSSTNANDfTOSLTGKGNDSEEEMRFQEVFMEHMSNMEKRIQHILDMEANLMOTEH 

FQNFSMTTDQRFNDILLQLSTLFSSVQGHGNAIDEISKSLISLNTTLLDLQLNIENLNGKIQENTFKQQE 

EISKLEERVYNVSAEIMAMKEEQVHLEQEIKGEVKVLNNITNDLRLKDWEHSQTLRNrTLICJGPPGPPGE 

KGDRGPTGESGPRGFPGPIGPPGLKGDRGAIGFPGSRGLPGYAGRPGNSGPKGQKGEKGSGNTLTPFTKV 

RLVGGSGPHEGRVEILHSGQWGTICDDRWEVRVGQWCRSLGYPGVQAVHKAAHFGQGTGPIWLNEVFCF 



Or.uploacl atile: CchooseFiiO No file chosen 

STEP 2 - Set your Palnvise Alignment Options 
Alignment Type: ©Slowf Fast 

The ilefaun settings will lulfill tne needs ol most users and, tor that reason, are not visible. 



More options... I (Clicit here, ilyoti want to view or cfiange Ihe default settings.) 
STEP 3 - Set your Multiple Sequence Alignment Options 

Tne detault settings will luinil tne needs ol mosi users and. tor ittat reason, are not visible. 
[ More options... I (Click bare. If you warn to view or cbange the default settings.) 

STEP 4 -Submit your Job 

C Be noHlied by email (Ticktriisboxifyou want to be notmed by email when the results are available) 



. I 20J1 . ji. . iNP 6197.. . , 
. I 47271477 I refiNP 776194. 2 i 
. i 33598924 irefi UP 057324. a i 
.|I86413GoirefiNP 569057. li 
. i S803OB0 i ie£ I NP 006761.1 I 



203575 12 I ref I NP 619729. 11 
47271477 icefiNP 776194. 2 i 
33598924 iiefiNP057324. 2 i 
18641360 ii:e£iNP 569057. li 
i 5803080 I ref |HP_00e7ei. 1 { 



. . ..-ITLlOflPPGtp. . DlUSP.^nS dPRGPPGPluPPGL j-^ 

aSlALRNlSLAKGPPGPKGDOeDEGI^&GaPGIPGLPGLRGLPOBItGTPGL 343 
UGElMtHVTlLllGAPGPPGPUGPKGO.^GVKGPVGGltGPKGDPGSLCPLGP 494 
HGQLlKMrTILOGPPGPilGPHGDMGSOGPPGPTGdKGQKGBKGEPGPPGP 480 
GSAGSPGlUGLPGSPGSPGATGLKGSKGPTGLOGqqGRKGESGVPGPAGV 374 



KGOUGAIGrPGSUGLPGyAGHPGKSGPKGQKGEKGS GNTLTPPTKV 350 

PGPKGDDGKLGATGP.tC.MllGPKGDUGPKGEKGEKGDItAGDASGVEAP.H.H 393 
OCPQGOPCEAGPVGEllGPVGPRGPPCLKGSKGSFGT-GGPRGOPGPKGDl S4 3 
AGERGPIGPAGPPGERGGKGSKGSOGPKGSRGSPGK-PGPOGPSGQPGPP S29 
ttGEQGSPGLAGPKGAPGOAGOKGOOGVKGSSGBOGVKG&KGBRGENSVGV 424 




FIGURE 3 I Use of multiple sequence alignments to discover regions of 
evolutionary conservation and presumed functionality FASTA formatted 
protein sequences of the scavenger receptors were obtained as described 
previously for SRAI {NP_619729.1), MARCO {NP_006761.1), SCARA3 
(NP_057324.2), SCARA4 (NP_5690571), and SCARA5 (NP_776194.2) and 
inputted into the Multiple Sequence Alignment tool, ClustalW2 (A). The 
sequences were aligned; a portion of the alignment with the highest 
conservation across all five sequences is shown (B).The user may choose to 
view colored output, where red represents small, hydrophobic amino acids 
(AVFPMILW), blue represents acidic amino acids (DE), magenta represents 
basic amino acids (RK), and green represents STYHCNGQ (hydroxyl. 



with structural information are available. The Protein Data Bank 
(PDB) is a worldwide collection of macromolecular structures 
governed by the Research CoUaboratory for Structural Bioinfor- 
matics (RCSB). This online, searchable database^ has come a long 
way from its meager beginnings as a repository established in 197 1 
for seven structures, as it is now home to 92104 structures and 
counting (27). Each experimentally validated entry is assigned a 
PDB Identifier that can be used to search against the database. 
Alternatively, information such as the molecule name or author 
may be used. 

A quick search of PDB with the search term "SCARA3" resulted 
in no hits. This is unsurprising given that little work has been 
done with this protein. However, since we know from our sequence 
analyses that there are regions of homology between SCARA3 and 
the other receptors, it is worth searching for these proteins as well. 
A search for "MARCO" revealed a structure (PDB ID: 20Y3 ) of the 
SRCR domain of the mouse MARCO protein (Figure 4). The PDB 



^http://www.pdb.org 



sulfhydryl, amine, and glycine). Coloring allows the viewer to visualize the 
distribution of charge and hydrophobicity in the protein. In this example, we 
see that there is an orderly distribution of hydrophobic amino acids (red). The 
degree of consensus is represented with symbols. (*) Indicates positions 
which have a single, fully conserved residue; (:) indicates conservation 
between groups of strongly similar properties; (.) represents conservation 
between groups of amino acids with weakly similar properties. The fact that 
all five members of this family share this highly conserved region at locations 
in these proteins indicated with pink rectangles, (C), and that it is the highest 
area of conservation within the proteins is strongly suggestive of a conserved 
function. 



entry for this structure includes information such as the citation to 
the original publication, the functional classification of this region, 
its molecular weight, and an exportable macromolecular structure. 
Structures can be downloaded in a variety of formats, including 
as a form of coded text saved as a .pdb file or as a static.jpg image. 
The .pdb file gives the user a chance to interact with the struc- 
ture by moving it along an axis, coloring based on amino acid 
type, or calculating potential protein-ligand interaction partners. 
These types of manipulations can be implemented in freely avail- 
able software such as UCSF's Chimera (28) or others summarized 
in Table 3. 

Unfortunately for our explorations of SCARA3, our previous 
sequence analyses indicate that the SRCR domain of MARCO -the 
only current macromolecular structure of a scavenger receptor - 
is not a region that is shared between these two receptors and, 
thus, it does not indicate any new information about our pro- 
tein of interest. As structural prediction technologies improve, and 
more experiments are conducted, the size of PDB will grow, but 
even in its current state it is an excellent resource for structural 
information. 



Frontiers in Immunology | Molecular hnate Immunity 



December 2013 [Volume 4 | Article 416 | 6 



Whelan et al. 



Guide to bioinformatics for immunologists 



Summarv 



Crystal structure analysis of the monomeric SRCR 
domain of mouse MARCO 



DOI:10.2210/pdb2oy3/pdb 



20Y3 



iQwnloaa Files ■• 



Primary CHation 



Crystal structure of the cystelne-rlch domain of scavenger receptor MARCO 

reveals the presence of a basic and an acidic cluster that both contribute to ligand 
recognition. 

Ojala, J.R. , Piltlcarainen, T. / , TuutUla, A.^ , Sandalova, T., , Tryggvason, K.P 

Journal: (2007) J.Biol.Chem. 282: 16654-16666 

PubMed: 17405873 & 

DOI: 10.1074/jbc.M701750200 & 

Search Related Articles in PubMed ^ 

PubMed Abstract: 

MARCO is a trimeric class A scavenger receptor of macrophages and dendritic cells that 
recognizes polyanionic particles and pathogens. The distal, scavenger receptor cystelne-rlch 
(SRCR) domain of the extracellular part of this receptor has been implicated in ligand 
binding. To... [ Read More & Search PubMed Abstracts ] 



t Molecular Description 



Classification: Ligand Binding Protein 
Structure Weight: 11385.91 O 



Molecule: 

Polymer: 

Chains: 

Fragment: 

Organism 

Gene Name 

UniProtKB: 



Macrophage receptor MARCO 

1 Type: protein Length: 102 

A 

C-terminal domain, scavenger receptor cysteine-rlch domain (SRCR) 

Mus musculus, 

Marco 

■ Protein Feature View^ | Search PDB^ | Q60754 



I 



Q60754 ^m^^^^m^^^^^^m 

M-jiffc. Pro<:essina M^crophiop receptor MARCO 



Biological Assembly ? | 



® 




A View in 3D 



More Images... 



Biological assembly 1 assigned by authors 
Downloadable viewers: 
Simple Viewer Protein Worlcshop 

Klosl< Viewer 



t MyPDB Personal Annotations Hide 



To save personal annotations, please login 
to your MyPDB account. 



t Deposition Summary 



Anth'-'s- osala J.R.M.P, Plkkarainen, 



FIGURE 4 I The Protein Data Banl< (PDB) entry for a macromolecular 
structure of a scavenger receptor. Because crystal structures of 
proteins are more difficult to obtain than their protein sequences, the 
PDB database is less populated than sequence databases such as 
NCBI's Entrez. However, PDB is still an excellent resource. Here, an 
example of the detailed entry for PDB ID 20Y3 is displayed after a search 



PROTEIN STRUCTURAL PREDICTIONS 

However, even if an experimentally verified protein structure such 
as those in PDB does not exist for a protein of interest, predic- 
tions as to the potential secondary structure of a protein can still 
be made based on the primary protein sequence. One common 
method is the reliance on identifying similar motifs in a protein 
sequence of interest when compared to a well-studied protein with 
known function (29). However, use of this method risks the trans- 
fer of incorrectly annotated information from protein to protein, 
thus potentially causing the corruption of genome databases if 
perpetuated (30). Other methods are based on highly complex 
algorithmic analyses, which make simplifying assumptions that 



for "MARCO" was performed. Information is displayed such as the 
primary citation from which this structure was submitted, and a small 
visualization of the structure. Further, more detailed visualizations can be 
created easily by the user by downloading the .pdb formatted file from 
the top right of an entry, and displaying it in software such as UCSF 
Chimera. 



exchange some accuracy for an algorithmic solution (31). These 
algorithms take into account certain patterns characteristic of a 
secondary structure, which tend to be represented in the primary 
sequence. For example, collagen, the main constituent of con- 
nective tissue, is generally encoded as a combination of glycine, 
proline, hydroxyproline, and hydroxylysine (32). These patterns 
allow bioinformatic tools to predict certain secondary structures 
such as collagenous regions from a primary sequence. 

Psipred is an excellent example of such a predictive tool. Psipred 
is an online resource, which combines multiple secondary struc- 
ture prediction methods into one, easy-to-use web-interface (33). 
First, psipred generates a sequence profile of the user's sequence 



www.frontiersin.org 



December 2013 | Volume 4 | Article 416 | 7 



Whelan et al. 



Guide to bioinformatics for immunologists 



using BLAST, which determines areas of conservation and varia- 
tion (33). Conserved areas denote areas of functionality, as well as 
areas that form the core of the protein; whereas, variable regions 
not responsible for specific folds, or the integrity of the pro- 
tein structure generally exist on the surface (33). These sequence 
profiles give this tool its first hints as to the protein's structure. 



Subsequently, an algorithmic approach is used to compare those 
patterns found in the sequence of interest to those identified in 
other proteins. 

The results of inputting the human SCARA3 protein sequence 
into the online Psipred tool gave us an indication of which seg- 
ments of the sequence formed a-helices and ^-sheets (Figure 5). 



Table 3 | Summary of publicly available software for the modeling of macromolecular structures. 



Name 



Hosted by 



URL 



Features 



Availability Reference 



UCSF 
Chimera 


Resource for biocomputing, 
visualization, and informatics 
at University of California, 
San Francisco, CA, USA 


http://www.cgl. 
ucsf.edu/chimera/ 


Allows interactive visualization of macromolecular 
structures. Along with .pdb files, one can also 
import density maps, sequence alignments, and 
trajectories among other information. Python script 
plugins 


For download 
on all major 
platforms 


Pettersen 
et al. (28) 


BioBlender 


Science visulization unit, 
Consiglio Nazionale Delle 
Ricerche (CNR) 


http://bioblender.eu 


Built as an extension of blender, open-source 3D 
modeling software used for video games and 
animation, is able to display physical and chemical 
properties of a protein 


For download 
on all major 
platforms 


Andrei et al. 
(64) 


Jmol 


Various 


http://jmol. 
sourceforge.net 


Visualization of 3D protein structures in a variety of 
input formats including .pdb, can measure 
distances in A. Great introductory animation at URL 


Web applet 


(65) 



These software can import .pdb formatted files for viewing and/or manipulation and modeling. 



UCL Department Of Computer Science 


(' ■Bloomibury 
Centre tor 




Bioinformatics Group 


BraintormatKs 


ETThI 



Site Navigation 



Inlroduetian 

PublicatiDns 
Web Servers 
Downloads 



Server Navigation 

PSIPRED Server 
PS:pR6D help 
Server Overview 



The PSIPRED Protein Sequence Analysis Workbench 

The PSIPREC; Protein Sequence Anst/sis Workbench aggregates several UCL structure prediclian methods into one bca 
sequence, oerforrn the oredictisns of their choice and receive the results of the orediclion via e-mail or the iveB. 
For a surnmarv of the available methods you tan read More 

The PSIPRED Team 

Current Contributors David T, Jones. Daniel Buchan, Tim Nugent, Federico Minneci tt Kevin Brysoh 
Previous Contributors Ama Lobley. Sean Ward, Liatn J, McGuffiri 

For queries regarding PSIPRED! □siDred@c5.ucl.3c.uk 



a PSIPRED v3.3 [Predict Secondary Structure) 
. pGcnTHREADER (Pronie ftased Fold Recognitic 

I SioSerf v2.0 (Automated Homology Modelling] 
J FFPred v2.0 {Eukaryoiic Function Prediction) 
J MEMPACK (SVM Prediction of TM Topology an. 
Help... 



0:SOPRED2 (Disorder Prediction! 
■ MEMSAT3 & MEM SAT SVM (Membrane Helix I 

_ OomPred (FVolein Domain Prediction; 
.jGeiiTHftEADEIt(Rapid Fold Recognition; 
rj pDomTHREAOER {Fold Domain Recognition; 



=■91133593924 ireflNP 0S732'-.,2I scavenger receptor class A -ncmDe- 3 isofo'ii 1 

MKVRSAGGDGDALQirEEDLACDDEDMPTFPCTQKGRPGPRCSRCQKNLSLHTSVRILYLFL^LLL 

VLASLVFRKVDSLSEDISLTQSIVDKKLVLMQKNLQGLDPICALNNCSFCHEAGQLGPEIRKLQEELE 



If yc 



h to test II 



IS folbw this Rnk tc 



Emait Address lor; 



Password (on^ required for licenced commercial e-mail addresses] 

I n 



| SCARA3| r~ 



conf > SniaainiiiiiiiiiiiiiiiiiiiiiiiiiiiiM 

Pr«d: 



conf I JlnnMlnsiaanlllinillllllassBaasslaii 

Pr.d: H 1 (D - 



Pr-d: 



Conr t 
Pr.dt 



Pr«d: fy- 



200 
■if 



FIGURE 5 I The use of Psipred for the prediction of the secondary protein 
structure of SCARA3. The Psipred tool combines various secondary protein 
prediction algorithms into one web-interface. Upon inputting the NCBI RefSeq 



protein sequence of SCARA3, Psipred outputted structural predictions, 
including the location of a-helices (pink cylinders) and p-sheets (yellow 
arrows). 



Frontiers in Immunology | Molecular Innate Immunity 



December 2013 [Volume 4 | Article 416 | 8 



Whelan et al. 



Guide to bioinformatics for immunologists 



When we were analyzing the protein sequences of all the scavenger 
receptors as part of our determination of the evolution of the pro- 
tein family (24), we were able to build off of this information to 
discover that some of the predicted a-helix segments were indeed 
coiled-coil motifs based on the form HxxHcccH where hydropho- 
bic (H) residues were interspersed with other amino acids (x), 
some of which were more likely to be charged (c) (34, 35). There 
are a few other tools that work in a similar fashion to Psipred, 
which we have reviewed in Table 4. 

In addition to these general tools, there are others that focus 
on predicting specific aspects of different types of proteins. The 
TMHMM Server, for example, focuses on the prediction of trans- 
membrane domains using a statistical model (36). Output from 
this tool, indicates whether a protein has a transmembrane domain 
and its predicted location. Additionally, tools such as SignalP focus 
on the prediction of signal peptide cleavage sites within an amino 
acid sequence, which can add to the user's knowledge of a protein's 
structure (37). 

TRANSCRIPTOMICS 

GENE EXPRESSION PROFILES TO ANSWER IMMUNOLOGICAL 
QUESTIONS 

Studies of global gene expression ("transcriptomics") using 
microarrays, RNA sequencing (RNAseq), and other platforms 
have been a valuable tool for immunologists. Transcriptomics 
can be used to discover "gene signatures" of disease states or to 
provide mechanistic insight into disease etiology. Because variabil- 
ity within individuals dictate symptoms and disease progression, 
it is very rare that changes in expression of a single gene wiU be 
sufficiently robust for diagnosis; however, combinatorial changes 
that indicate a common mode of regulation are more robust 
and allow for the formation of "gene signatures." For example, 
an "interferon signature" of gene expression was discovered in 
lupus when type I interferon inducible genes were found to be 
elevated in the peripheral blood mononuclear cells (PBMCs) of 
patients with lupus compared to healthy controls (38). Other 
notable discoveries in immunology made using transcriptomics 
include the discovery of the mechanisms of genetic regulation 
associated with lipopolysaccharide (LPS) tolerance (39), predict- 
ing long-term survival from breast and other cancers (40), and 
studying changes in microbial gene expression over the course of 
disease (41). As the immunology community's use of transcrip- 
tomic data increases, public repositories such as the NCBI's Gene 



Expression Omnibus*, EBI's Gene Expression Atlas^, and other 
specialized sites such as http://www.macrophages.com/ contain a 
rich amount of data waiting to be mined. These resources include 
transcriptional profiles of different immunological cell types and 
activation states in a wide range of organisms. Although there are 
challenges with comparing microarray data from different plat- 
forms and sources (42) the cost savings of reproducing publicly 
available experiments have increased the appeal of utilizing public 
resources. 

Transcriptomics has also fed the immunologist's obsession with 
characterizing leukocyte subsets and lineage. In some cases, defin- 
ing cells by their transcriptional profile has proven to be as effec- 
tive as sorting by flow cytometry (42). These data have inspired 
researchers to search for the holy grail of transcriptional profiles 
that characterize subsets of immune cells and are more specific 
than surface markers. Although this approach has been some- 
what successful [e.g., in identifying a novel subset of NK cells; 
(43), for cell types such as macrophages and dendritic cells that 
seem to have a more plastic phenotype and ontogeny, the use- 
fulness of this approach has been a subject of debate (44, 45)]. 
Nonetheless this quest has inspired the creation of the Immuno- 
logical Genome Project^ (46). This consortium of researchers is 
characterizing the transcriptional profile of immune cells based 
on rigid sorting and purification profiles, and although these data 
consist almost entirely of mouse genes in the steady state, it is a 
valuable resource to the immunology community. In our attempt 
to learn about SCARA3, we used the "Gene Skyline" and "Mod- 
ules and Regulators" tools (Figure 6A) to find that transcripts of 
SCARA3 are expressed broadly across a wide range of cells at rel- 
atively low abundance (Figure 6B). There is no published data 
describing how SCARA3 is transcriptionally regulated; however, 
four transcription factor binding sites (NFIA, TALI, KLF4, and 
LM02) and two regulatory regions are predicted to occur in the 
promotor region of SCARA3 (Figure 6C). The Immgen database 
allows researchers to glean a considerable amount of data about 
their gene of interest with very little investment or specialized 
knowledge. 

Although the Immgen database is probably the most user 
friendly, it is dominated by mouse immune cell subsets. Other 



http://www.ncbi.nlm.nih.gov/geo/ 
^https://www.ebi.ac.ukygxa/ 
^ www.immgen.org 



Table 4 | Tools for the prediction of secondary structure characteristics. 



Name 



Hosted by 



URL 



Features 



Reference 



psipred 



JPred 



CFSSP (Chou and Fasman 
Secorndary Structure 
Prediction) Server 



University College London (UCL) 
Department of Computer Science 

University of Dundee 



BioGem.org 



http://bioinf.cs.ucl. Uses PSI-BLAST to determine regions of 

ac.uk/psipred/ homology which inform their predictions 

http://www. Takes into account solvent accessibility in 

compbio.dundee. its predictions; displays PDB matches if 

ac.uk/www-jpred applicable 

http://biogem.org/ Uses the Chou and Fasman algorithm to 

tool/chou-fasman/ predict helices, sheets, turns, and coils 



Jones (33) 



Cole et al. (66) 



Chou and Fasman 
(67) 



www.frontiersin.org 



December 2013 | Volume 4 | Article 416 | 9 



Whelan et al. 



Guide to bioinformatics for immunologists 



IrnmunoTogical Genome 

Project 



Dota Rcqjost 












Dalo Broui/cr/ 



Dtllerenliol /plicin9 

Ditfcrcntially soliccd isoforms in different immuroloEical cdl- 
typcs, derived from junction analysis of RNA-seq data 3r>d 
from f»^aHirr-le«H anal|^%ii of mirroarrav data- 

OImIhIo/ ttad ll*9«latow 

Interactive display of the modules of coregulated gerws deflrtad 
from ImmGcn data, and the transcription focton (TTsJ predicted 
to control them. Search by gene or by IF, view module 
(.utTipusiliuii antl rxp>»tiun, |>it^i)itie(l rtrt;ijl<ilui wei|;hts. 

RNA-^ 

Gene CKprciiion pnjfiles Rcncratcd from CD19 B or CD4* T ccHs 
by RNA sequencing (lllumina) can be visuaiizcd on the UCSC 
Gertome Browser; values for each gene are quantified per gene 
on the Skyline histogram viewer in a separate dataeroup. 

Humon/Meu/e eomp«tri/on 

Corn|i;ifi'\ ihf Mprp-iilon of Inllvidibtl gcr*»s in haim^tn us mouse 
immune cell lincaecs, ar>d the co regulated modules the gene 



The mmGen data and browsen are developed as a general resource for the community, pnncipally supported t>v funds from the niaid/nih. if these da'^ were of value to 
you, we would be grateful if you could mention immcen in the acknowrledgments of publicationsthst were enriched by these data (tor enampk: This work bencfiUea from 
data osxmbied by the immscn coasortiom'). and/or quote *c primary immGcn reference: Hcna TS. ct oi invnunohf/ko) Gcnorre Project consortium. Nat Immunol, zooa 
0-1O9S It wuuld alMibif iiK)%l u^rhll if y<iu touU send us rin irriull for out irtordi 
immGen data browses can be accessed with nar^dard web browsers on wndows or Mscinto^ systems (not tested on Linux-based browsers). An Adobe Flash player 4 
r higher] is required for the Gene Skyinc/ Constellation browser. Tttc use of rircfox or chrome with the Population Comparisan web application is stmngly 
I reoommcnded. [kmrnin iiHjes with internet Expbrer might aflect features and performance of this application J 





Probe Set: 


10420691 


view ConotetMon 


Gena Symbol: 


ScaraS 


Sam* CwM in 
Other Oatagroup 


Title: 


"scavenger receptor class A, member 3' 


Alfaisea: 


APC7 III CSR III CSR1 /// MSLR1 HI MSRl 


Chromoconw: 


14 


Location: 


66538231 


NCBI: 


219151 


Unlgener 


Mm. 344095 


Enwinbl: 




KEGG: 




GO: 


tnol»culw_ function (38741 H 
cellulBr_corrvor>«nl |5S75) ^\ 
endoplasmic roDculum (S783) ^ 


SigiiatuieOB: 
{human only) 


&lQnatur«DB (219151) 



Gene Skyline 

Irrmunolofflc*! Oarrama Prejvct 



Gene Constelia 



C Sequence motifs enriched regolators: 



S«IBOL 


|PWrM_N.\ME 


Remarks 1 


|gaga.oi 


GAGA-Box 1 


1 1 AG_rich__coding 


AG__rich_codiiig 


LM02|TAL1|TCF3|GATA1 


|GATA1.06 


Complex of Lmo2 bound to Tal-1, E2A proteins, and GATA-1, half-site 2 


KLF4 


|gklf.oi 


Gut- enriched Krueppel-Hke factor 


NFLA XFIB NFIC|NFIX 


|NF1.01 


Xuclear factor 1 


TAL1|TCF3 


|TAL1-TCF3 


MA0091 



FIGURE 6 I Querying the Immunological Genome Project 
(http://immgen.org) for data on expression and transcriptional 
regulation of SCARA3. (A) The Immunological Genome project has a 
number of ways to browse the data and visualize patterns of gene expression 
and transcriptional regulation. (B) Using the "Gene Skyline" browser we see 



that the transcript for SCARA3 is expressed at low levels in most cell types in 
the database. (C) Using the "Modules and Regulators" browser we see that 
there are four predicted transcription factor binding sites (NF1.01, GATA1.06, 
GKLF01, andTAL1-TCF3) and two regulatory regions (GAGA.OI, 
AG_rich_coding) in the promoter of SCARA3. 



Frontiers in Immunology | Molecular hnate Immunity 



December 2013 | Volume 4 | Article 416 | 10 



Whelan et al. 



Guide to bioinformatics for immunologists 



resources such as IRIS (Immune response in silico) take a similar 
approach to characterizing the transcriptional profiles of human 
leukocyte subsets and include different activation states (47). 

GENETIC VARIATION 

ANALYSIS OF SINGLE-NUCLEOTIDE POLYMORPHISM 

The most common type of variation within the human genome are 
single-nucleotide polymorphisms (SNPs), which occur, on aver- 
age, every 1200 base pairs (48). SNPs can be non-synonymous 
or synonymous; non-synonymous SNPs result in a change in the 
amino acid sequence of the translated protein, while synonymous 
SNPs do not alter the amino acid composition because of the 
redundancy of the genetic code. 

Single-nucleotide polymorphism analysis of a protein can 
greatly aid in the understanding of its function as these small 
alterations can result in substantial changes in the functional- 
ity of the protein. For example, a SNP at a receptor's binding 
site may alter the original protein such that it would be able 
to bind a pathogen that it previously was unable to, or, in con- 
trast, may abolish its ability to bind its usual binding partner. In 
one study, researchers studied differences in SNP frequencies of 
Mal/TIRAP to explain differences in TLR2 and TLR4 signaling 
between European and African populations (49). After cloning 
the two variants, S180L and S180, results indicated that S180L 



heterozygous individuals had a higher cytokine production level 
than S180 homozygous individuals (49). Lower allele frequencies 
of S180L in African and Asian populations might indicate selec- 
tion occurred after humans migrated from Africa since the variant 
may have granted added bacterial resistance in the changing habi- 
tat (49). This study demonstrates how SNP analyses can be used 
to identify functional domains of a protein as well as uncover a 
protein's potential evolutionary history. 

There are several publicly available online databases for the 
analysis of SNPs in a protein of interest (summarized in Table 5); 
here, we use The University of California, Santa Cruz (UCSC) 
Genome Browser^ to perform an analysis of SNPs present within 
SCARA3. Regions of interest can be searched for by entering the 
name of a gene or its corresponding chromosomal position. The 
Genome Browser contains multiple "tracks" that contain differ- 
ent types of annotation, including those based on NCBI RefSeqs, 
mRNA alignments, and UCSC Genes (50) (Figure 7). In addition, 
the browser can display reports regarding gene expression, regu- 
lation, and variation, among other information (50). The UCSC 
Genome Browser includes an annotated SNP track with over 23 
million reference SNPs from NCBI's SNP Database (dbSNP) (50) 



''http://genome.ucsc.edu 



Table 5 | Publicly available single-nucleotide polymorphism (SNP) databases. 



Name 


Hosted by 


URL 


Features 


Availability 


Reference 


UCSC 


University of 
California, Santa 
Cruz, CA, USA 


http://genome. 
ucsc.edu/ 


Integrated browser displaying tracks built 
from annotation sets including SNPs, mRNA, 
disease association studies, and more 


Web applet 


Kent (68) 


dbSNP 


National Center for 

Biotechnology 

Information 


http://ncbi.nlm. 
nih.gov/SNP/ 


Central database of SNPs with integrated 
data from multiple population studies 
including the 1000 genome project 


Web applet 


Sherry et al. 
(48) 


GWAS central 
(formerly HGVbase 
database) 


Institutes, Consortia, 
and individual 
laboratories 


http://gwas 
central.org/ 


Database of human genetic variation. 
Displays information on phenoytpes, genes, 
regions, or markers based on SNPs 


Web applet 


Fredman et al. 
(69) 


ENSEMBL 


European 
Bioinformatics 
Institute (EBI) 


http://ensembl. 
org/ 


Contains available genomes of multiple 
species. Displays summary information 
regarding isoforms, SNPs, and other features 
of genes or proteins 


Web applet 


Flicek et al. 

(70) 


HapMap 


National Center for 

Biotechnology 

Information 


http://hapmap. 

ncbi.nlm.nih. 

gov/ 


Contains integrated data of SNPs for 
haplotype analysis, finding tag SNPs, and for 
identifying GWAS hits 


Web applet 


Gibbs et al. 
(71) 


1000 Genome 
Project 


European 

Bioinformatics 

Institute 


http://1000 
genomes.org 


Contains 1092 available human genomes for 
analysis as well as summary documentation 
regarding SNPs and other variation 


FTP download 


Abecasis et al. 

(72) 


HaploView 


The Broad Institute 


http://broad 
institute.org/ 


Calculates and D' values for performing 
haplotype analysis of SNPs with HapMap 
data or user input data 


For download 
on all major 
platforms 


Barrett et al. 

(73) 



T7i/s list includes only SNP databases that focus on human and/or mouse sequences; other, more specialized databases may exist for other organisms. All databases 
listed accept novel SNPs from private and public organizations. 



www.frontiersin.org 



December 2013 | Volume 4 | Article 416 | 11 



Whelan et al. 



Guide to bioinformatics for immunologists 



UCSC Genome Browser on Human Feb. 2009 (GRCIi37/hgl9) Assembly 

re:27, 491, 577-27,530,537 38,961 bp, i,' i , , , j- i i 




Conservation 
Ifull □ 

Vertebrate 
cnain/Net 
I hide I -r I 



(E Cons Indels 
MmCt 



Comparative Genomics 



(E Evo Cpg 
I hide I - 1 



Primate Chain/Net 



Placental 
Chain/Net 
Itiide |-r| 



□ 
□ 

m 



Neandertal Assembly and Analysis 



Denisova Assembly and Analysis 
Variation and Repeats 



Common 
SNPsn37> 
[pack PI 
Mult 

SNPsnss') 

I hide I -r I 
SNPs (131 "I 
I hide I - I 
HGDP Allele 
Freg 




Flagged SHPsn37)Mult SNPsn371 



hide 



hide U 



All SMPsn371 
I hide [71 



All SNPs(135'l 
I hide I -r I 

1000G Phi Vars 



Common SNPs(1321 f^gljj 
I hide [71 



hide U 



(E Hapr.lap SNPs 
I hide 7] 



1000G Ptil Accsbl 
[hide |-r| 

DGV Struct Var 
[hide hi 



f< GIS DMA PET 



Simple RepeatsMicrosatellite 
I hide I - I I hide | - | 

B 



Sen Cnain 
[hide |-r| 



Segmental Pups 
I hide I -r I 

© Genome 
Variants 
I hide [71 



common 
SMPsn351 
I hide I -r I 

Mult SNPsn321 
I hide I - 1 

St HAIB Genotype 
I hide I - 1 

RepeatMasker 
full [7] 

NumtS Sequence 
I hide I - 1 



Flagged 
SNPsn351 
[hide |-r| 

All SNPsn32> 
[hide |-r| 

SMPfCNV Arrays 
[hide l-rl 

Interrupted Rpts 
I hide [71 



UCSC Genome Browser on Human Feb. 2009 (GRCh37/hgl9) Assembly 



:7 ,491,577-27,530, 537 39,961 b[ 



!=!= 
























































" — . — ■ g-. = — iiiiBrif^-^ 


II 









FIGURE 7 I Using the UCSC Genome Browser to search for 
single-nucleotide polymorphisms (SNPs) in SCARA3 This browser 
contains multiple "tracks," including the location of SNPs across the 
length of a protein. Here we show the output from inputting the NCBI 
RefSeq for SCARA3 isoform (A). Further options to hide or show more 
annotation tracks are available directly below the graphical output. 



Under the "Variation and Repeats" tab, selecting "pacl<" under the 
" Common SNPs" option updates the output to include a full display of 
SNPs represented by their refSNP cluster ID numbers [(B), circled]. 
Clicking on any of the refSNP cluster IDs leads to a link displaying 
further information regarding the SNP as well as a link to NCBI's dbSNP 
database. 



(Figure 7B). SNPs are annotated using a refSNP cluster ID number 
(rs#) which represents all SNPs, often from muhiple population 
studies, that map to the same location in the gene. Additionally, 



each individual SNP within a cluster is associated with a SNP 
Accession number (ss#) (48). Selecting a refSNP cluster within the 
Genome Browser will display information such as the nucleotide 



Frontiers in Immunology | Molecular hnate Immunity 



December 2013 | Volume 4 | Article 416 | 12 



Whelan et al. 



Guide to bioinformatics for immunologists 




dbSNP 

Short Genetic Variations 



PUbMed Nucleotide Protein Genome Stmcture PopSet Taxonomy OMIM Books SNP 



I Search for SNP on NCBI Reference Assembly | 

Search Enb-ez SNP 




Have a question 
about dbSHP? Tiy 
searching ttie Sljfc. 
FAQ Archives 



GENERAL 

HUMAN VARIATION 
Search, Annotate, 
Submit 
Annotate and 
Submit Batch Data 
with Clinical Impact 
Attributes for 
Filtering Variation 
SNP SUBMISSION 
DOCUMENTATION 
SEARCH 
RELATED SITES 



Reference SNP(refSNP) Cluster Report: rsl 7057523 
RefSNP 

Organism:human (Hq^tio saciens l 
Molecule Type:Genomic 
Created /Updated in build:l23/137 
Map to Genorre Bulld: 37.4 

Validation Status: Hi 



dNP Details are organized in the following sections: 
I Mil|'TiW-l-tf«TiH 



Variation Class: ^ 



Allele 

-single nucleotide variation 
RefSNP Alleles: CfT 

Allele Origin: 
Ancestral Allele:C 
Clinical Channel:unKnown 
Clinical Significance:NA 
MAF/MinorAlleleCount: 0=0.120/261 

MAF Source:1000 Genomes 

I M ■] jjiiMaW IB Ml 



HGVS Names i 

NC_000008,10:g,27528446T>C 

NM_016240.2:c.1399T>C 

NlvM82826,1:c,1370-5640T>C 

NP_057324.2:p.Phe467Leu 

NT_167187.1;g.15386592T>C 



r.ks. Lir.kOut 



■integrated Maps (Hint click on ' Chr Pos' or 'ConBg Pos' column value to see vartaflon In NCBI sequence viewer) 

Assembly w ^r"^T* Chr Chr Pos Contig Contig Pos 




reference 
Celeta 
HuRer 
HuRef 

CHM1_1.0 



104.0 
104.0 



2e438534 
2e07M86 
2gQ7348e 
27768160 



WW 001B3912' 
KW 001B3912' 



MW 004078Q3S.1 



5&0e5B7 
59065B7 
20027113 



Fwa 

Fw8 
¥\t<i 

Fwd 



Contig 


Contig 
to 


Neighbor 


Map 


allele 


Chr 


SNP 


Method 


T 










T 


FvvO 






blast 


T 


FwO 






blast 


T 


Fwd 






blast 


T 


Fwd 








C 


Fwd 









|Population Diversity 





Sample Ascertainment 


Genotvoe Detail Alleles 


ss* 


Population 


Individual Chrom. c-,,™ 
Group Sample Cnt 


C/C 


c/T Tyr C T 



.s2;i8?357;otel YRI lew oovefaoe carel 



.52M37S135pitot • CEU to* oovsrsQe panel 
.52435!.985 AFO EUR PAN6L Siropear 
AFP AFR PANEL Aff>Mn An- 

AFD CHN PANEL Asian 
}524'247S55pttoI : CH5*JPT Ica' oovefaoe panel 



>5342Z55S52ESP Cohort Populations 



15454X798 HapMap-CEU 
HapMap-HCB 
HapWap^lPT 
HapMjp-YRI 

EGP Y0RU5-PAMEL 
EGP hlS?-?ANEL 
ESP CEPH-PANEL 
EGP AD-PANEL 
EGP ASIAN-PANEL 
HAPl.'AP-ASVV 
hAPI,'AP-CHB 
HAPMAP-CHD 



lis 

120 
4S 
45 
4S 
120 
3724 

European 225 
Asian £5 
Asian 172 
Sub-Sahafan African 225 
Siib-Saharan Affican24 
Hrspani: 44 
European 44 
Afnoan An-eniar 
Asian 



JO 
4S 



82 
170 
176 



AF 

AF 
13 
IS 
IG 
AF 
3F 
IG 
IG 
IS 
IG 
IG 
IS 
IS 
IS 
IG 
IS 
IS 
IS 
IG 



0.042 0.958 
0.125 0.S75 
3 157 0 SJ3 l.OMO.OM 0.917 
0.174 0.S28 1. 0000.087 0.913 
0.042 0.333 0,S25 1.O0O0.2OS 0.792 
0.167 0.S33 

0 034 3 -23 3 355 3 5343 353 3 532 
3.22! 3.779 3 ;«43.111 3.SS5 

0.023 0.372 0.605 0.4390 2M 0 791 
0.337 0.663 3.3173 155 3.S31 



3.37- 3 525 vco;: : 



0.157 0,833 1,0030 083 0.5i' 



O045 0.136 0 818 0.150O114 0.886 



0.182 0.818 1.O30O 091 0.909 



0 257 0.733 1 0030 133 O.S 



3.417 3 553 3.43S0.238 0.752 



3,020 3,Ce- 3,518 3,3233.351 3.545 



3.122 3,46,3 3 415 - 033 3 354 3 546, 



0.035 0.294 0.671 l.OOOO 182 0 818 



0.023 0.318 0.659 0.5270.182 0.SI8 



FIGURE 8 I Example results page from thie NCBI dbSNP database for 
SCARA3 SNP rs17057523. By following the link from the UCSC Genome 
Browser to dbSNR more information is provided for SNP rs17057523 
including allele frequencies, ancestral alleles, and chromosomal position 
(A). Following this information on the database website, are other tabs that 



show more information regarding the SNP that may be useful to investigators. 
The "Population Diversity" section displays information regarding allele 
frequencies from different sampled populations (B). Clicking any of the 
population links shows information on how the SNP was genotyped, the 
population sample size, and other experimental conditions used. 



www.frontiersln.org 



December 2013 | Volume 4 | Article 416 | 13 



Whelan et al. 



Guide to bioinformatics for immunologists 



change, chromosomal position, and type of variant as well as a 
Hnk to the dbSNP database (Figure 8), which contains further 
detail on the population studies associated with the SNP, includ- 
ing observed allele frequencies and links to other resources such 
as GenBank and PubMed (48). The dbSNP database can also be 
accessed externally through NCBI, and individual SNPs can be 
searched for using their SNP Accession number, population study 
name, or via a BLAST search (51). 

When the UCSC Genome Browser is used to search for 
SCARA3, the resulting SNP track shows all of the reported SNPs 
within the gene (Figure 7B). Most of the annotated SNPs within 
SCARA3 are intronic variants, which would not alter the resul- 
tant protein; however intronic regions have been shown to be 
involved in regulatory processes. Of the three SNPs found in the 
exons of SCARA3, rsl7057523 has the highest global minor allele 
frequency of 0.120 based on The 1000 Genome Project phase 1 
data. Following the external link to dbSNP's "Population Diver- 
sity" section shows that the SNP is found at higher frequencies in 
Asian populations, with allele frequencies up to 0.222 while other 
populations remain close to 0. 1 (Figure 8) . Additionally, the "Mul- 
tiz Alignment" track shows areas of conservation between multiple 
vertebrates and suggests that SNP rsl7057523 is present within a 
conserved area of SCARA3. Further testing by cloning the vari- 
ant can help determine the function of this domain by examining 
functional differences between the SNP and wildtype allele. 

FURTHER ANALYSES 

What has been covered here represents the basic knowledge upon 
which most bioinformatic analyses will be conducted. As in any 
field, there are a plethora of examples of highly specialized bioin- 
formatic tools and software that have been developed for the 
various sub-fields of immunology. For example, HLA peptide 
binding predictions can be made using various tools such as that 
available from the National Institute of Health^ (52). While an 
exhaustive list of such programs cannot be given, we suggest that 
the reader referred to other, more specialized reviews of such tools 
[(53-55) for example]. 

CONCLUDING REMARKS 

In our opinion, bioinformatics is a methodology that is under- 
utilized in immunological studies. Far from being inaccessible and 
complicated, many bioinformatic tools are straightforward and 
available via online servers, meaning that a researcher can obtain 
results instantaneously without fear of the often-steep learning 
curve associated with installable software. Although a strong back- 
ground in computer science is an asset for more complicated 
techniques, in order to perform the analyses that we have described 
here, a passing familiarity with the cut and paste function is all 
that is required. If the reader is interested in going beyond this, 
there are excellent, fi-eely available resources such as Software Car- 
pentry^, Rosalind^", and online courses such as those available at 
Coursera'^ and edX^^. Acquiring vocabulary is probably the most 



^http://www-bimas.cit.nili.gov/index.shtml 
^http:// software- carpentry.org/ 
^*'http://rosalind.info/ 
^ ^ http://www.coursera.org 
^^http://www.edx.org 



challenging aspect of venturing into bioinformatics; however, one 
might argue that this is considerably easier to master than the 
language of immunology with its interminable number of inter- 
leukins, CD numbers, and signaling pathways. The goal of this 
review is to demonstrate some basic principles and techniques that 
are easily incorporated into the average bench scientist's research 
and to encourage immunologists and cell biologists to consider 
using in silico approaches to generate and test hypotheses and 
answer research questions. Of course, like all hypotheses, those 
generated with in silico approaches must be experimentally tested. 
Whether in silico approaches are more or less accurate that tradi- 
tional methods of hypothesis generation are yet to be evaluated. 
Our inquiry into the properties of SCARA3 indicates that these 
tools are immensely useful in generating hypotheses that can then 
be tested bench-side. Although many researchers have decried the 
lack of trained bioinformaticians and bioinformaticists, perhaps 
the best way to overcome the current shortage may be for scientists 
to become conversant in some of the basic techniques of bioinfor- 
matics in much the same way that we must be knowledgeable of the 
statistical tools required to analyze and understand our research. 

ACKNOWLEDGMENTS 

This work was funded by a Natural Sciences and Engineering 
Research Council grant to Dawn M. E. Bowdish. Fiona Whelan was 
funded by an Ontario Graduate Scholarship (OGS). Additionally, 
work in the Bowdish laboratory is supported in by the McMaster 
Immunology Research Centre (MIRC) and the Michael G. DeG- 
roote Institute for Infectious Disease Research (IIDR). Nicholas 
Yap was funded by a Natural Sciences and Engineering Research 
Council Discovery grant to G. Brian Golding. Fiona J. Whelan and 
Dawn M. E. Bowdish conceived and designed this article. Fiona 
J. Whelan, Nicholas V. L. Yap, G. Brian Golding, and Dawn M. 
E. Bowdish drafted the manuscript. All authors read, edited, and 
approved the final manuscript. 

REFERENCES 

1. Hesper B, Hogeweg P. Bioinformatica: een werkconcept. Kameleon (1970) 
1:28-9. 

2. Moore WJ. Schrodinger: Life and Thought. Cambridge: Cambridge University 
Press (1992). 

3. Schrodinger E. What is Life?: The Physical Aspects of Living Cell with Mind and 
Matter and Autobiographical Sketches. Cambridge: Cambridge University Press 
(1967). 

4. Olson MV. The human genome project. Proc Natl Acad Sci USA (1993) 
90:4338-44. doi:10.1073/pnas.90.10.4338 

5. Wishart DS. DrugBank: a comprehensive resource for in silico drug discovery 
and exploration. Nucleic Acids Res (2006) 34:D668-72. doi:10.1093/nar/gkj067 

6. Roach J, Glusman G, Rowen L, Kaur A, Purcell M, Smith K, et al. The evolu- 
tion of vertebrate Toll-like receptors. Proc Natl Acad Sci USA (2005) 102:9577. 
doi:10. 1073/pnas.0502272 102 

7. Levasseur A, Pontarotti P. Was the ancestral MHC involved in innate immunity? 
Eur} Immunol (2010) 40:2682-5. doi:10.1002/eji.201040856 

8. Rapin N, Lund O, Bernaschi M, Castiglione R Computational immunol- 
ogy meets bioinformatics: the use of prediction tools for molecular bind- 
ing in the simulation of the immune system. PLoS One (2010) 5:e9862. 
doi:10.1371/journal.pone.0009862 

9. Seal JB, Alverdy JC, Zaborina O, An G. Agent-based dynamic knowledge rep- 
resentation of Pseudomonas aeruginosa virulence activation in the stressed gut: 
towards characterizing host-pathogen interactions in gut-derived sepsis. Theor 
BiolMedModel (2011) 8:33. doi:10.1186/1742-4682-8-33 



Frontiers in Immunology | Molecular Innate Immunity 



December 2013 | Volume 4 | Article 416 | 14 



Whelan et al. 



Guide to bioinformatics for immunologists 



10. Chau TA, McCully ML, Brintaell W, An G, Kasper KJ, Vines ED, et al. Toll-like 
receptor 2 ligands on the staphylococcal cell wall downregulate superantigen- 
induced T cell activation and prevent toxic shock syndrome. Nat Med (2009) 
15:641-8. doi:10.1038/nm.l965 

11. Bowdish D, Gordon S. Conserved domains of the class A scavenger receptors; 
evolution and function. Immunol Rev (2009) 227:19-31. doi:10.1111/j.l600- 
065X.2008.00728.X 

12. Pearson WR. Searching protein sequence libraries: comparison of the sensitivity 
and selectivity of the Smith-Waterman and FASTA algorithms. Genomics (1991) 
11:635-50. doi:10.1016/0888-7543(91)90071-L 

13. Benson DA, Karsch-Mizrachi 1, Lipman DJ, Ostell ], Wheeler DL. GenBank. 
Nucleic Adds Res (2008) 36:D25-30. doi:10.1093/nar/gkm929 

14. Schuler GD, Epstein JA, Ohkawa H, Kans JA. Entrez: molecular biology database 
and retrieval system. Methods Enzymol (1996) 266:141-62. 

15. Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered informa- 
tion atNCBI. Nucleic Acids Res (2007) 35:D26-31. doi:10.1093/nar/gkl993 

16. Pruitt KD, Maglott DR. RefSeq and LocusLink: NCBI gene-centered resources. 
Nucleic Acids Res (2001) 29:137^0. doi:10.1093/nar/29.1.137 

17. Gambler JC. Antigen and Fc receptor signahng. The awesome power of the 
immunoreceptor tyrosine-based activation motif (ITAM). J Immunol (1995) 
155:3281-5. 

18. Fong LG. Modulation of macrophage scavenger receptor transport by protein 
phosphorylation. J Lipid Res (1996) 37:574-87. 

19. Fong LG, Le D. The processing of ligands by the class A scavenger receptor 
is dependent on signal information located in the cytoplasmic domain. / Biol 
Chem (1999) 274:36808-16. doi:10.1074/jbc.274.51.36808 

20. Blom N, Gammeltoft S, Brunak S. Sequence and structure-based prediction of 
eukaryotic protein phosphorylation sites. / Mol Biol (1999) 294(5):1351-62. 
doi:10. 1006/jmbi. 1999.33 10 

21. Kimura M. The Neutral Theory of Molecular Evolution. Cambridge: Cambridge 
University Press (1984). 

22. Barton NH. Evolution. New York: Cold Spring Harbor Laboratory Press (2007). 

23. Mount DW. Bioinformatics: Sequence and Genome Analysis. New York: Cold 
Spring Harbor Laboratory Press (2004). 

24. Whelan FJ, Meehan CJ, Golding GB, McConkey B), Bowdish DM. The evo- 
lution of the class A scavenger receptors. BMC Evol Biol (2012) 12:227. 
doi:10.1186/1471-2148- 12-227 

25. Acton S, Resnick D, Freeman M, Ekkel Y, Ashkenas J, Krieger M. The collage- 
nous domains of macrophage scavenger receptors and complement component 
Clq mediate their similar, but not identical, binding specificities for polyanionic 
ligands. J Biol Chem (1993) 268:3530. 

26. Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott 
C, et al. CDD: a Conserved Domain Database for the functional annotation of 
proteins. Nucleic Acids Res (2010) 39:D225-9. doi:10.1093/nar/gkqll89 

27. Westbrook J. The protein data bank and structural genomics. Nucleic Acids Res 
(2003) 31:489-91. doi:10.1093/nar/gkg068 

28. Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, 
et al. UCSF chimera: a visualization system for exploratory research and analysis. 
JComputChem (2004) 25:1605-12. doi:10.1002/jcc.20084 

29. Whisstock 1, Lesk A. Prediction of protein function from protein sequence and 
structure. QRevBiophys (2003) 36:307-40. doi:10.1017/S0033583503003901 

30. Hegyi H, Gerstein M. The relationship between protein structure and function: a 
comprehensive survey with application to the yeast genome 1. J Mol Biol (1999) 
288:147-64. doi:10.1006/jmbi.l999.2661 

31. Burkowski FJ. Structural Bioinformatics: An Algorithmic Approach. Boca Raton, 
FL: Chapman 8( Hall/CRC (2008). 

32. Gross J, Dumsha B, Glazer N. Comparative biochemistry of collagen some amino 
acids and carbohydrates. Biochim Biophys Acta (1958) 30:293-7. doi:10.1016/ 
0006-3002(58)90053-2 

33. Jones DT. Protein secondary structure prediction based on position-specific 
scoringmatrices.7Mo;Bwi(1999)292(2):195-202.doi:10.1006/imbi.l999.3091 

34. McAlinden A. Helical coiled-coU oUgomerization domains are almost ubiqui- 
tous in the collagen superfamily ]Biol Chem (2003) 278:42200-7. doi:10.1074/ 
jbc.M302429200 

35. Parry DAD, Eraser RDB, Squire JM. Fifty years of coiled-coils and a-helical bun- 
dles: a close relationship between sequence and structure. / Struct Biol (2008) 
163:258-69. doi:10.1016/j.jsb.2008.01.016 



36. Krogh A, Larsson B, Heijne Von G, Sonnhammer E. Predicting transmem- 
brane protein topology with a hidden Markov model: application to complete 
genomesl. JMolBiol (2001) 305:567-80. doi:10.1006/jmbi.2000.4315 

37. Petersen TN, Brunak S, Heijne von G, Nielsen H. SignalP 4.0: discrimi- 
nating signal peptides from transmembrane regions. Nat Methods (2011) 8: 
785-6. doi:10.1038/nmeth.l701 

38. Baechler EC, Batliwalla FM, Karypis G, Gaffney PM, Ortmann WA, Espe KL 
et al. Interferon-inducible gene expression signature in peripheral blood cells 
of patients with severe lupus. Proc Natl Acad Sci USA (2003) 100:2610-5. 
doi: 1 0. 1073/pnas.0337679 1 00 

39. Foster SL, Hargreaves DC, Medzhitov R. Gene-specific control of inflamma- 
tion by TLR-induced chromatin modifications. Nature (2007) 447(7147):972-8. 
doi:10.1038/nature05836 

40. van't Veer LJ, Dai H, Van De Vijver MJ, He YD, Hart AA, Mao M, et al. Gene 
expression profiling predicts clinical outcome of breast cancer. Nature (2002) 
415:530-6. doi:10.1038/415530a 

41. Orihuela CJ, Radin JN, Sublett JE, Gao G, Kaushal D, Tuomanen EI. Microarray 
analysis of pneumococcal gene expression during invasive disease. Infect Immun 
(2004) 72:5582-96. doi:10.1128/L«.72.10.5582-5596.2004 

42. Wang Y, Joshi T, Zhang XS, Xu D, Chen L. Inferring gene regulatory net- 
works from multiple microarray datasets. Bioinformatics (2006) 22:2413-20. 
doi: 10. 1093/bioinformatics/btl396 

43. Koopman LA, Kopcow HD, Rybalov B, Boyson JE, Orange JS, Schatz F, 
etal. Human decidual natural killer cells are a unique NK cell subset with 
immunomodulatory potential. / Exp Med (2003) 198:1201-12. doi:10.1084/ 
jem.20030305 

44. Hume DA, Mabbott N, Raza S, Freeman TC. Can DCs be distinguished 
from macrophages by molecular signatures? Nat Immunol (2013) 14:187-9. 
doi:10.1038/ni0813-876d 

45. Randolph G, Merad M. Can DCs be distinguished from macrophages by mole- 
cular signatures? Nat Immunol (2013) 14:189-90. doi:10.1038/ni.2517 

46. Heng TS, Painter MW, Elpek K, Lukacs-Kornek V, Mauermann N, Turley SJ, et al. 
The Immunological Genome Project: networks of gene expression in immune 
cells. Nat Immunol (2008) 9: 1091-4. doi: 10. 1038/nil008- 1091 

47. Abbas AR, Baldwin D, Ma Y, Ouyang W, Gurney A, Martin F, et al. Immune 
response in sUico (IRIS): immune-specific genes identified from a compendium 
of microarray expression data. Genes Immun (2005) 6:319-31. doi:10.1038/sj. 
gene.6364173 

48. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: 
the NCBI database of genetic variation. Nucleic Acids Res (2001) 29:308-11. 
doi:10.1093/nar/29.1.308 

49. Ferwerda B, Alonso S, Banahan K, McCall MB, Giamarellos-Boixrboulis EJ, 
Ramakers BP, et al. Functional and genetic evidence that the Mal/TIRAP allele 
variant 180L has been selected by providing protection against septic shock. Proc 
NatlAcadSci USA (2009) 106:10272-7. doi:10.1073/pnas.0811273106 

50. Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, et al. 
The UCSC Genome Browser database: update 2011. Nucleic Acids Res (2010) 
39:D876-82. doi:10.1093/nar/gkq963 

51. Barnes MR, Gray IC. Bioinformatics for Geneticists. Hoboken, NJ: Wiley (2003). 

52. Parker KG, Bednarek MA, Coligan JE. Scheme for ranking poLenlial HLA-A2 
binding peptides based on independent binding of individual peptide side- 
chains. J Immunol (1994) 152:163-75. 

53. Korber B, LaBute M, Yusim K. Immunoinformatics comes of age. PLoS Comput 
Biol (2006) 2:e71. doi:10.1371/journal.pcbi.0020071 

54. Tong JC, Ren EC. Immunoinformatics: current trends and future directions. 
DrugDiscov Today (2009) 14:684-9. doi:10.1016/j.drudis.2009.04.001 

55. Tomar N, De RK. Immunoinformatics: an integrated scenario. Immunology 
(2010) 131:153-68. doi:10.1111/j.l365-2567.2010.03330.x 

56. Kulikova T. The EMBL nucleotide sequence database. Nucleic Acids Res (2004) 
32:27D-30D. doi:10.1093/nar/gkhl20 

57. Miyazaki S. DNA Data Bank of Japan (DDBJ) in XML. Nucleic Acids Res (2003) 
31:13-6. doi:10.1093/nar/gkg088 

58. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. 
The human genome browser at UCSC. Genome Res (2002) 12:996-1006. 
doi:10.1101/gr229102 

59. Julenius K. NetCGlyc 1.0: prediction of mammalian C-mannosylation sites. 
Glycobiology (2007) 17:868-76. doi:10.1093/glycob/cwm050 



www.frontiersin.org 



December 2013 | Volume 4 | Article 416 1 15 



Whelan et al. 



Guide to bioinformatics for immunologists 



60. Maurer-Stroh S, Eisenhaber F. Refinement and prediction of protein prenylation 
motifs. Genome Biol (2005) 6:R55. doi:10.1186/gb- 2005-6-6- r55 

61. Monigatti F, Gasteiger E, Bairoch A, Jung E. The Sulfinator: predicting tyro- 
sine sulfation sites in protein sequences. Bioinformatics (2002) 18:769-70. 
doi:10.1093/bioinformatics/18.5.769 

62. Duckert P, Brunak S, Blom N. Prediction of proprotein convertase cleavage sites. 
Protein Eng Des Sel (2004) 17:107-12. doi:10.1093/protein/gzh013 

63. Radivojac P, Vacic V, Haynes C, Cocklin RR, Mohan A, Heyen JW, et al. Identifi- 
cation, analysis, and prediction of protein ubiquitination sites. Proteins (2010) 
78:365-80. doi:10.1002/prot.22555 

64. Andrei RM, Loni T, Callieri M, Zini MF, Maraziti G, Pan MC. BioBlender: A 
Software for Intuitive Representation of Surface Properties ofBiomolecules. (2010). 
p. 1-19. Available at: http://cds.cern.ch/record/1294402 

65. ]moh An Open-Source Java Viewer for Chemical Structures in 3D. Available at: 
http://www.jmol.org/ 

66. Cole C, Barber JD, Barton GJ. The Jpred 3 secondary structure prediction server. 
Nucleic Acids Res (2008) 36:W197-201. doi:10.1093/nar/gkn238 

67. Chou P, Fasman G. Prediction of protein conformation. Biochemistry (1974) 
13:222^5. doi:10.1021/bi00699a002 

68. Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res (2002) 12:656-64. 
doi:10.1101/gr.229202 

69. Fredman D, Siegfi-ied M, Yuan YP, Bork P, Lehvaslaiho H, Brookes AJ. HGVbase: 
a human sequence variation database emphasizing data quality and a broad 
spectrum of data sources. Nucleic Acids Res (2002) 30:387-91. doi:10.1093/nar/ 
30.1.387 

70. Flicek P, Ahmed I, Amode MR, BarreU D, Beal K, Brent S, et al. Ensembl 2013. 
Nucleic Acids Res (2012) 41:D48-55. doi:10.1093/nar/gksl236 



71. Gibbs RA, Belmont JW> Hardenbol P, Willis TD, Yu F, Yang H, etal. 
The international HapMap project. Nature (2003) 426:789-96. doi:10.1038/ 
nature02168 

72. Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, 
et al. An integrated map of genetic variation from 1 ,092 human genomes. Nature 
(2012) 490:56-65. doi:10.1038/naturell632 

73. Barrett JC, Fry B, Mailer J, Daly MJ. Haploview: analysis and visualization 
of LD and haplotype maps. Bioinformatics (2005) 21:263-5. doi:10.1093/ 
bioinformatics/bth457 

Conflict of Interest Statement: The authors declare that the research was conducted 
in the absence of any commercial or financial relationships that could be construed 
as a potential conflict of interest. 

Received: 10 September 2013; accepted: 13 November 2013; published online: 04 
December 2013. 

Citation: Whelan FJ, Yap NVL, Surette MG, Golding GB and Bowdish DME 
(2013) A guide to bioinformatics for immunologists. Front. Immunol. 4:416. doi: 
10.3389/fimmu.2013.00416 

This article was submitted to Molecular Innate Immunity, a section of the journal 
Frontiers in Immunology. 

Copyright © 2013 Whelan, Yap, Surette, Golding and Bowdish. This is an open-access 
article distributed under the terms of the Creative Commons Attribution License (CC 
BY). The use, distribution or reproduction in other forums is permitted, provided the 
original author(s) or licensor are credited and that the original publication in this 
journal is cited, in accordance with accepted academic practice. No use, distribution or 
reproduction is permitted which does not comply with these terms. 



Frontiers in immunology | Molecular Innate Immunity 



December 2013 | Volume 4 | Article 416 1 16 



